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(57) ABSTRACT 

A data processing system, circuit arrangement, integrated 
circuit device, program product, and method utilize source 
identification information to selectively route data to differ- 
ent memory sources in a shared memory system. This 
permits, for example, data lo be routed to only a portion of 
the memory sources associated with a given requester, 
thereby reducing the bandwidth to other memory sources 
and reducing overall latencies within the system. Among 
other possible information, the source identificatioD infor- 
mation may include an identification of which memory 
source and/or which level of memory is providing the 
requested data, and/or an indication of what processor/ 
requester and/or what type of instruction last modified the 
requested data. 

27 Claims, 3 Drawing Sheets 
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SELEX^TIVE ROUTING OF DATA IN A same memory system to permit the microprocessors to work 

MULTI-LEVEL MEMORY ARCHITECTURE together to perform more complex tasks. The multiple 

BASED ON SOURCE IDENTIFICATION microprocessors are typically coupled to one another and to 

INFORMATION the shared memory by a system bus or other like intercon- 

5 nection network. By sharing the same memory system, 

FIELD OF THE INVENTION however, a concern arises as to maintaining "coherence" 

The invention is generally related to data processing between the various memory sources in the shared memory 

systems and processors therefor, and in particular to retrieval system. 

of data from a multi-level memory architecture. For example, in a typical multi -processor environment, 

10 each microprocessor may have one or more dedicated cache 
BACKGROUND OF THE INVENTION memories that are accessible only by that microprocessor, 
Computer technology continues to advance at a remark- ^^^el one (LI) data and/or instrucUon cache, a level two 
able pace, with numerous improvements being made to the cache, and/or one or more buffers such as a hne fill 
performance of both microprocessor^the "brains" of a ^^^^r and/or a transiUon buffer. Moreover, more than one 
computer— and the memory that stores the information microprocessor may share certain caches and other memo- 
processed by a computer. ^ 'wqU. As a result, any given memory address may be 

, „ „ stored firom time to lime in any number of memory sources 

In general, a microprocessor operates by executmg ^ • . . j 

sequence of instructions that form a computer program. The ^ ^ memory sysiern. 

instructions are typically stored in a memory system having Coherence is typically maintained via a central directory 

a pluraHty of storage locations identified by unique memory or via a distributed mechanism known as "snooping", 

addresses. The memory addresses collectively define a whereby each memory source naaintains local state infor- 

"memory address space," representing the addressable range mation about what data is stored in the source and provides 

of memory addresses that can be accessed by a micropro- such state information to other sources so that the location 

cessor. of valid data in the shared memory system can be ascer- 

Both the instructions forming a computer program and the ^ ^^^^^ ^^^^ ciUier scheme, data may need to be copied into 

data operated upon by those instructions are often stored in ^^^^o^ different memory sources to maintain 

a memory system and retrieved as necessary by the micro- coherence, e.g., based upon whether a copy of the data has 

processor when executing the computer program. The speed ^^^"^ modified locally within a particular memory source 

of microprocessors, however, has increased relative to that ^^^^ ^t^^'^" ^ requester intends to modify the data onoe 

of memory devices to the extent that retrieving instructions requester has access to the data. Any tune data is copied 

and data from a memory can often become a significant "^^^ or out of a particular memory source, however, the 

bottleneck on performance. To decrease this botUeneck, it is "^^^^^^^ ^^^^ ^ temporarily unavailable and the latency 

desirable to use the fastest available memory devices pos- associated with accessing data stored in the source is 

sible. However, both memory speed and memory capacity 35 i^^creased 

are typically directly related to cost, and as a result, many As a result, it is often desirable for performance consid- 

computer designs must balance memory speed and capacity erations to minimize the amount of data transfers, or 

with cost. bandwidth, between memory sources in a shared memory 

A predominant manner of obtaining such a balance is to system. Minimizing data transfers with a particular memory 

use multiple "levels" of memories in a memory architecture 40 ^^^^^^ increases its availability, and thus reduces the latency 

to attempt to decrease costs with minimal impact on system required to access the source. 

performance. Often, a computer relies on a relatively large, Many shared memory systems also support the concept of 
slow and inexpensive mass storage system such as a hard "inclusion", where copies of cached memory addresses in 
disk drive or other external storage device, an intermediate higher levels of memory are also cached in associated 
main memory that uses dynamic random access memory 45 caches in lower levels of memory. For example, in the 
devices (DRAM*s) or other volatile memory storage multi-processor environment described above, all memory 
devices, and one or more high speed, limited capacity cache addresses cached in the LI cache for a microprocessor are 
memories, or caches, implemented with static random also typically cached in the L2 cache for the same 
access memory devices (S RAM's) or the like. In some microprocessor, as well as within any shared caches that 
instances, instructions and data are stored in separate 50 service the microprocessor Consequently, whenever a pro- 
instruction and data cache memories to permit instructions cessor requests data stored in the shared memory system, the 
and data to be accessed in parallel. One or more memory data is typically written into each level of cache that services 
controllers are then used to^ swap the information from the processor. 

segments of memory addresses, often known as "cache Inclusion is beneficial in that the number of snoops to 

fines", between the various memory levels to attempt to 55 lower level caches can often be reduced given that a higher 

maximize the frequency that requested memory addresses level cache includes directory entries for any associated 

are stored in the fastest cache memory accessible by the lower level caches. However, having to write data into 

microprocessor. Whenever a memory access request multiple memory sources occupies additional bandwidth in 

attempts to access a memory address that is not cached in a each memory source, which further increases memory 

cache memory, a "cache miss" occurs. As a result of a cache eo access latency and decreases performance. Furthermore, 

miss» the cache line for a memory address typically must be storing multiple copies of data in multiple memory sources 

retrieved from a relatively slow, lower level memory, often such as caches reduces the effective storage capacity of each 

with a significant degradation in performance. memory source. With a reduced storage capacity, hit rates 

Another manner of increasing computer performance is to decrease, thus further reducing the overall performance of a 

use multiple microprocessors operating in parallel with one 65 shared memory system. Moreover, particularly with a 

another to perform different tasks at the same time. Often, snoop-based coherence mechanism, as the number of 

the multiple microprocessors share at least a portion of the memory sources that contain a copy of the same data 
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increases, the amount of bandwidth occupied by checking memory sources associated with a first requester, however, 

and updating state information and maintaining coherence the requested data is routed directly to the first requester 

increases as well. without routing the requested data to any of the memory 

Therefore, a significant need continues to exist for a sources associated with the first requester responsive to the 

manner of increasing the performance of a shared memory 5 source identification information. 

system, particularly to reduce the bandwidth associated with These and other advantages and features, which charac- 

each memory source and thereby decrease memory access terize the invention, are set forth in the claims annexed 

latency throughout the system. hereto and forming a further part hereof. However, for a 

better understanding of the invention, and of the advantages 

SUMMARY OF THE INVE>rnON lo and objectives attained through its use, reference should be 

The invention addresses these and other problems asso- made to the Drawings, and to the accompanying descriptive 

ciated with the prior art by providing a data processing matter, in which there is described exemplary embodiments 

system, circuit arrangement, integrated circuit device, pro- of invention. 

1^ product, and method that utilize sour^ identification ^^^^ DESCRIPTION OF THE DRAWINGS 
tmormation to selectively route data to different memory 

sources in a shared meniory system. This permits, for FIG. 1 is a block diagram of a data processing system 

example, data to b e routed to only a portio n of the memory consistent with the invention. 

s ources associated witn a given requester, thereby reducmg FIG. 2 is a block diagram of the shared memory archi- 

the bandwidth to other memory sources and reducing overall lecture for the data processing system of FIG. 1. 

V latenci es within the system) Consequently, as opposed to ^ piG. 3 is a block diagram of a processor integrated circuit 

TflClUSiCn-basecl designs wnere every memory source asso- device in the data processing system of FIG. 1. 

ciated with a given requester receives a copy of requested 4 ^ ^^^^^ diagram of an iUustrative response signal 

data, the routing of data may be more selective to ensure that consistent with the invention. 

data is made available in the most critical memory sources 5 ^ ^ ^^^^^ ^ ^^^^^^ ^^^^ processing 

without necessarily tying up other rneniory sources for ^ consistent with the invention, 

which the requested data is not particularly cntical. ' ^ . „ , - , , - - , 

. • - ij r FIG. 6 IS a flowchart illustraLmg the logic flow associated 

Source identification iniormation may include, tor .... „. . r*u • 

,.. .. ■■ ^ r— ^ / with handhng a read request from one of the processmg units 

example, an identincaiion 01 which rncmorrsTJttrce^aiia/or • *i. ^ * • * f r-ir- e 

. , , — p ■ J.I. , — -r— m the data processing system of FIG. 5. 

v^b ibh level ot me mory is providing me requested aata. As 

ah example, it may"b6acsirable'"to^electivel)nroute ^ DETAILED DESCRIPTION 

requested data to only the LI cache for a particular The iUustrated implementations of the invention generally 

processor, but not its L2 cache, if it is determined that the ^ ^ selectively routing requested data to memory 

requested data is located m the LI cache for anodier pro- ^^^^^^^ associated with a particular requester in response to 

cessor. By doing so, bandwidth to the L2 cache for the ^^^^^^ identification information supplied by the memory 

requesting processor is conserved, and the cffccUve capacity ^^^^^^ ^j^^^ . requested data. A requester may 

of the L2 cache IS increased since the unnecessary data is not ^ processor or processing unit, or any other logic circuitry 

stored m the L2 cache. utilizes data stored in a shared memory system, e.g.. 

Source ident ification information may also include, for input/output adapters and/or mterf aces, memory controllers, 

example, an indication of what proce^sQC/requesErln d/o^ ^ cache controllers, etc.Amemory source, in turn, can include 

what type of instruction last moditied m e re quested d ata. In practically any data storage device or subsystem in a shared 

tSIs latter i nstance, the' source identification informati on memory system fi:om which identification and/or state infor- 

could be used, for example, to enable^ata to be sent du:ecay j^^tioD may be maintained, including main storage and 

to ^a requester without occupying any additional memo ry various levels of cache memories, irrespective of the level of 

sourceTwhen an access m g mstniction (Sirraatedj rusome ^^^^ ^^he memories, whether such cache memories are 

fashion with the particular instruction and/or requesterj hat internal or external relative to a processor or other requester, 

las t modified the data. — ^ - whether such cache memories are data-only memories or 

'T)ther types of information may also be maintained as collective data/instruction memories, whether such cache 

source identification information as wiU become apparent memories are dedicated to a particular requester or shared 

from the disclosure hereinafter. Thus, the invention is not 50 among several requesters, etc. A memory source can also 

hmited solely to the particular source identification infor- include other shared or dedicated memories, including vir- 

mation implementations described herein. tual memory, e.g., as implemented with one or more direct 

Consistent with one aspect of the invention, a method is access storage devices in a page-based memory system. A 

provided for routing data in a multi-reque ster circu it memory source may also include memories distributed in a 

arrangement mcluding a plurality ot re questers co upled to a S5 cache-only memory architecture (COMA) or a non-uniform 

plurality Of Biemo'ry sources, witn e ach requester associate d memory architecture (NUMA) system. Furthermore, a 

with at l east & portion ot the plura lity of memory source s. memory source can also include other buffers or registers 
The method mciud6S Iresponmng to a memory request byjg^ that may serve as a source for data, including translation 

fi rst requester among tbC |>iurality ^f req uesters, includin g lookaside buffers, processor registers, processor buffers, etc. 

providing sour ce identificat ion intormation associate<f wit h eo A memory source is considered to be associated with a 

the mcmoiy^SBHythat IS retur ning the requested data; Etnd , particular requester when the memory source is dedicated to 

r gponsive to the source lacntincalion information, sejc c- that requester, Le., when the memory source services only 

tively rou ti ng ibe requested data to at least one ot the one requester A memory source may also be associated with 

memory sources associated with t he first requester . a requester when the memory source services that requester 

Consistent' with amJlhei' aspect ol tlie invention, another 65 along with other requesters, so long as that memory source 

method is provided for routing data in a multi -requester is directly accessible by the requester or another memory 

circuit arrangement. Rather than routing to at least one of the source that is dedicated to that requester. 
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As dis cussed above source id cntificalion i nfonnatio o may arrangements implemented in fully functioning integrated 

include, t or cxana^lc , a n identification of which £cmo r>' circuit devices and data processing systems utilizing such 

source ^ and/pr wfai3i Icvcl_ of memory is p roviding the devices, those skilled in the art will appreciate that circuit 

requested'^ ta. Source identification_inf^o nDation may 'also arrangements consistent with the invention are capable of 

i nclude, toT^xample. an indication oF^Kat'^proccssor/ 5 being distributed as program products in a variety of forms, 

requesTer a nd/or w hat typ^t iSstruction 13SI mod ihed t pe and that the invention applies equally regardless of the 

r equested da ta. Utl ier types oi mtormauon^ mciuding th e particular type of signal bearing media used to actually carry 

state tound in the' sourcing memory, may also be use d out the distribution. Examples of signal bearing media 

consist ent with the ^nventinn'! ' include but are not limited to recordable type media such as 

Turning to the Drawings, wherein like numbers denote volatile and non-volatile memory devices, floppy disks, hard 

like parts throughout the several views, FIG. 1 illustrates the disk drives, CD-ROM's, and DVD's, among others and 

general configuration of an exemplary data processing sys- transmission type media such as digital and analog commu- 

tem 10 suitable for selectively routing requested data con- nications hnks. 

sistent with the invention. System 10 generically represents, Data routing consistent with the invention may be cen- 

for example, any of a number of multi-user computer ^5 tralized within one or more central routing circuits. In the 

systems such as a network server, a midrange computer, a illustrated implementation, however, a snoopy coherence 

mainframe computer, etc. However, it should be appreciated mechanism is used, and as such the data routing circuitry is 

that the invention may be implemented in other data pro- distributed among the various requesters and sources, as 

cessing systems, e.g., in stand-alone or single-user computer well as response combining circuitry (discussed in greater 

systems such as workstations, desktop computers, portable ^^^^^ below). It will be appreciated that the specific imple- 

computers, and the like, or in other computing devices such mentation of the logic discussed hereinafter for the data 

as embedded controllers and the like. One suitable imple- routing circuitry would be within the ability of one of 

mentation of data processing system 10 is in a midrange ordinary skill in the art having benefit of the instant disclo- 

computcr such as the AS/400 computer available firom sure. 

International Business Machines Corporation. 25 shared memory system represented by data process- 
Data processing system 10 generally includes one or more ing system 10 typically includes an addressable memory 
system processors 12 coupled to a memory subsystem address space including a plurality of memory addresses, 
including main storage 14, e.g., an array of dynamic random The actual data stored at such memory addresses may be 
access memory (DRAM). Also illustrated as interposed maintained at any given time in one or more of system 
between processors 12 and main storage 14 is a cache 3Q processors 12, main storage 14, caches 16, DASD 30, and/or 
system 16, typically including one or more levels of data, within a workstation 28 or over a network 26. Moreover, for 
instruction and/or combination caches, with certain caches caching purposes, the memory address space is typically 
either serving individual processors or multiple processors partitioned into a plurality of cache "lines", which are 
as is well known in the art. Moreover, as will be discussed typically contiguous sequences of memory addresses that 
below, at least some of the caches in cache system 16 may 35 are always swapped into and out of cadies as single units, 
be integrated onto the same integrated circuit devices as one By organizing memory addresses into defined cache lines, 
or more of system processors 12. Furthermore, main storage decoding of memory addresses in ca(±es is significantly 
14 is coupled to a number of types of external devices via a simplified, thereby significandy improving cache perfor- 
systcm bus 18 and a plurality of interface devices, e.g., an mance. By stating that a sequence of memory addresses 
input/output bus attachment interface 20, a workstation 40 forms a cache line, however, no implication is made whether 
controller 22 and a storage controller 24, which respectively the sequence of memory addresses is actually cached at any 
provide external access to one or more external networks 26, given time. 

one or more workstations 28, and/or one or more storage As shown in FIG. 2, data processing system 10 imple- 

devices such as a direct access storage device (DASD) 30. ments a shared memory system incorporating a plurality of 

It should be appreciated that data processing system 10 is 45 nodes 40 interfaced with main storage 14 over a shared 

merely representative of one suitable environment for use interconnect such as a bus 42 incorporating address fines 44 

with the invention, and that the invention may be utilized in and data lines 46. A bus arbiter 48 functions as a master for 

a multitude of other environments in the alternative. The bus 42, in a manner known in the art. 

invention should therefore not be limited to the particular Each node 40 includes a processor integrated circuit 

implementations discussed herein. 50 device 50 and an external L3 (tertiary) cache 52. Moreover, 

Selective data routing consistent with the invention is as shown in FIG. 3, each processor integrated circuit device 
typically implemented in a circuit arrangement disposed on 50 includes one or more processing units 54, eadi having a 
one or more programmable integrated circuit devices, and it dedicated internal LI (primary) data cache 56 associated 
should be appreciated that a wide variety of programmable therewith. The processing units, however, share an inte- 
devices may utilize selective data routing consistent with the 55 grated instruction/data L2 (secondary) cache 57, shown as 
invention. Moreover, as is well known in the art, integrated having an oo-board controller/directory 58 coupled to off- 
circuit devices are typically designed and fabricated using chip memory storage devices 60. Also shown in FIGS. 2 and 
one or more computer data files, referred to herein as 3 are buffers/registers 53, 55 respectively disposed within 13 
hardware definition programs, that define the layout of the cache 52 and processing unit 54 that may function as 
circuit anangements on the devices. The programs are 60 additional destinations for data in tocrtain embodiments 
typically generated by a design tool and arc subsequentiy (discussed below in greater detail), 
used during manufacturing to create the layout ma:^ that In the context of the invention, a node may be considered 
define the circuit arrangements applied to a semiconductor to include any grouping of memory sources that arc asso- 
wafer. Typically, the programs are provided in a predefined ciated with one or more requesters. For example, a node may 
format using a hardware definition language (HDL) such as 65 be defined at the processor or integrated circuit device level 
VHDL, vcrilog, EDIF, etc. While the invention has and (e.g., within each processor integrated circuit device 50), at 
hereinafter will be described in the context of circuit a card or board level, or at a system level, among others. 
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It should be appreciated that the shared memory system of memory address specified by a memory access request. The 

FIGS. 2 and 3 is merely representative of one possible comb in alioD of sooop logic and a directory or other suitable 

implementation. One skilled in the art will appreciate that logic that stores state information about a particular memory 

any nimiber and type of processors, as well as any number, source in data processing system 10 is also referred to herein 

type and level of caches and other memory sources, may be 5 as a "snooper'* device, whidi in some implementations may 

used in the alternative. also be considered to further include the control logic and/or 

— . . . r J . . ^ « memory storage for the particular memory source associated 

The shared memory system of data processmg system 10 ^^^^ ^ device 

is illustrated as utilizing a aioopy coherence mechanism to m. n 4- i • - •* j * 

, c . ^ - u The snoop response collection logic arcuits are used to 

permit a number of requester devices, e.g., each processing .u i i • i * u i i i • 

^ _ . ^ . ^ 7 *t. * in gather local response signals from the local snoop logic 

unit 54, to issue memory access requests that may access r • j • , * 

. - . , . • .t_ / circuits of the various snooper devices and generate a 

information stored m any memory source in the system, e.g., . ■ j ■ i r t i j • i 

- . T 1 /r-»n n u • *u * t combmed response signal for the local snooper device. In 

mam storage 14 or any L1/L2/L3 cache in the system. In *u n * * j • i 4 *u t 4- t5 t^i^ 

, ^ . . *L L the illustrated implementation, the functionality of the snoop 

response to such memory access requests, the snoopy coher- n i • • . u * j u * 

^ . . J * *t_ . * r*i_ IJ / \ response collection logic is distributed between processor- 

ence mechanism updates the state of the memory adaress(es) l - i i u- • i • •* t>i j .i. 

. ^ »u.*«^.ju»i. i^oi* chip-level combinmg logic circuit 74, and the system* 

m each memory source that is affected by the memory access i5 , . • . i • • T^/- a i. - 

, . . L- .-n-ij level combmmg logic circuit 66. As a result, m response to 

requests. A snoopy coherence mechamsm typically mcludes , & » > r 

1 ' . . .J* a particular memory access request, each processor-level 

snoop logic, which receives memory access requests, deter- . , ^ * . f. . . , 

' ^. ' f jj / \ • 1. circuit generates a processor response signal from the local 

nunes the state of the memory addresstes) in each memory • ■ i ^ . i_ • i • • 

. u- f .u » J / response signals output by the vanous snoop logic circuits 

source that are the subject of the requests, and outputs f, ^ ■ * * j • „ j ■ rt 

, . . . . r nn on the processor integrated circuit device. Then, the system- 
suitable local response sigpals representative of the states of ^ I 1 , . , * ,/ 

. ■ 1 jj-.- L 1 • level circuit collects the local response signals as well as any 

the vanous memory sources. In addition, such snoop logic jj *- i ■ i / r i_ 1 1 u m'x 

, J . .1. . . c jj • u additional response signals (e.g., from each L3 cache 52) 

may also update the state of a memory address m each , , ^ n i_- j • 

... J- and generates thcretrom an overall combined response sig- 

memory source m response to the request, as discussed m ^ r 

greater detail below. ™ , i , . • ^ • 

. . , ^ ' r 1 15 The local and combmed response signals are used by each 

In the dlustrated embodimcnl, the snoop logic for data p f the sources - and the requester t o v^ TM^S^ EBsSS^S^ 

processing system 10 is dostnbuted among a plurality of and sinked in a distrib'utSd fasFioTF^ exa mple. "Site 

snoop logic circuits that arc each associated with a parUcular system-level, w hen a so urce has the data available, 

memory sourjxm the system. The snoopy coherence mecha- source requests th e data bus from bus arbiter 48. and on ce ^ -) \ 

nism m the lUustrated impletnentalion IS also miplemented ( he arbiter g rants the bus to the source, the source p laEwthe 

as a two-level coherence mechanism, with coherence mam- d ata on the bus. Ihe requester o -BSgnFesTE eaMTSSTte blis ^ 

tained both at the processor level, as well as at the system and recognizes the data by virtue of a tag ' tli at id ZnGfiS ae 

■ requester . I'be requester then receives tfie data from the bus^ 

As shown in FIG. 2, at the system level, each memory Moreover, other sources (e.g., an L3 cache) associated with 

source includes a dedicated snoop logic circuit, including a ^^e requester may also detect the tag and receive the data as 
snoop logic circuit 62 for each processor integrated circuit 

device 50 and a snoop logic circuit 64 for each U cache 52. Eaduo Lthe local and combined resp onse signals may 

Response combining logic 66 is coupled to each of circuits i nchldTanT ^f the various types of sourci identificruoB 

62, 64 to combme the re^onses therefrom and distnbute a j^fo^SiSB h discus sed ^b6i>6. hot example, as shown in h TG, 

combined response to all memory sources. ^ 4 , a response signal 80 may be represented as a daU ^r d 

As shown in FIG. 3, at the processor level, each snoop including a plurality of bits broken into appro priate fields, 

logic circuit 62 incorporates a plurality of local snoop logic R esponse signal 80 includes a state Infor mation "EeIH'82 , 

circuits, e.g., a snoop logic circuit 70 for each LI data cache which may include, for example, any 'orall' of'the conve n- 

56, and a snoop logic circuit 72 for L2 cache 57. A local tib nal MESI states. Response 80 also iociuaS's our ce iden - 

response combining logic circuit 74 interfaces the local 45 t ification information 84, includi ng separate fields for source 

snoop logic circuits 70, 72 with the other system-level aioop l evel information 86. nodeldentigcn iori^ormgion^ 

logic circuits, as described in greater detail below. pr ocessing unit, or proces sor, identi £cation~informatio"n ~90. 

Wth the illustrated implementation, the local snoop logic The source level inf'ormatiomypicaily represents the level of 

and response combining logic circuits for each processor cache memory that sources the data, e.g., LI, L2, L3 or main 

maintain coherence within the LI and L2 cache memory 50 memory, and the node identification information typically 

sources for the processor, while the system-level snoop logic represents the particular node (e.g., at a chip, board or 

and response combining logic circuits maintain coherence system level) sourcing the data. T he^processor i dentification 

between the processors and the L3 caches. However, it in formation typically indicates what processor last rnpg iSed 

should be appreciated that a single-level snoopy coherence t h'e data, and further is used to distin guish betwe en multiple 

mechanism may be used in the altemalivc. Moreover, it 55 dedicate d cache memories in a p articuiar cache level (e.g.. 

should also be appreciated that snoop logic circuits may t o"distinguisfa between the multiple Ll caches m hiUr3)n t 

service multiple memory soiu-ces in the alternative, so a should be appreciated that mfK^rgjOftTn'^f]^ 

one-to-one mapping between snoop logic circuits and squrccs may be us^ in the alternative, e.g., s i mply^ J^igmc^g 



memory sources is not required. each poleiitial source^ unicjue~i(icntilicr, aino ng o fl t^r^ 

In general, memory access requests are issued by a so i \dditional source identification m formation naay also b e 

requester and handled first locally within the processor level i ncluded in eacii response, e.g., insmiction intormation 92 

of the snoopy coherence mechanism. Shoidd a caciie miss jrom which it can be determined what instruction la st 

occur at the processor level, a memory access request is then modified the data in t he memory. Eor example , it may be 

issued over the shared bus 42 and is "snooped" by each desirable to mdicaie when an instr ucdon accesses dalA wiLb" 

system-level snoop logic drcuit. Each snoop logic circuit 65 a lock or semapnore (e . g.. witfiTSTCTX or LAR X instruS op 

then interacts with a directory associated therewith to obtain in the PowerPC architechire) In s^ cb ^a circumstance, it i s 

and/or update the state information regarding a particular typically known that the requested data will not be used by 
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the requesting device after the operation, and it may be 
beneficial to simply provide the data dirccdy to the request- 
ing device (even bypassing all sources), as weU as to store 
a copy in a lower level cache for immediate access by other 
devices (e. g., to an L3 cach e). 

— ^Ifshould be appreciated"~tBai eacn response signal is 
typically transmitted in parallel and implemented using a 
plurality of lines that separately carry each bit encoded in the 
signal. Other manners of encoding a response signal, e.g., 
serial, may also be used. Moreover, it should also be 
appreciated that the source identification information carried 
by the varioxis response signals throughout a data processing 
system can vary from one another, particularly when certain 
information about a response signal source is inherently 
known by the recipient of the signaL As an example, the 
processor-level snoop response collection logic typically 
will receive the local response signals fi-om each of the 
processing tmits, LI caches and L2 cadies via separate 
inputs, and as such will be able to determine which memory 
is the source for the requested data without having such 
information provided by that source in its response signal. 
Also, the node within which the logic circuit is disposed is 
also known. Thus, local response signals from each 
processor-level memory may not need to include node, 
processing unit and/or memory level information to the 
collection logic circuit. 

It should be appreciated that the general implementation 
of a snoopy coherence mechanism is understood in the art. 
Moreover, other coherence mechanisms, e.g., directory- 
based mechanisms, may also be used in the alternative. 
Thus, the invention should not be limited to use with the 
particular snoopy coherence mechanisms described herein. 

The operational logic for implementing selective data 
routing consistent with the invention is typically distributed 
among the various sources and response combination logic. 
Moreover, implementation of such functionality would be 
apparent to one of ordinary skill in the art having the benefit 
of the disclosure presented herein. To simplify such an 
understanding, a specific exemplary data processing system 
100 is illustrated in FIG. 5, including two nodes 102, 104, 
each including a respective processor integrated circuit 
device 106, 108. 

Device 106 is implemented as a two-processing unit 
device, including processing units 110, 112 respectively 
serviced by dedicated LI caches 114, 116. Ashared L2 cache 
118 services each processing unit 110, 112. Similarly, device 
108 includes two processing units 120, 122 respectively 
serviced by dedicated LI caches 124, 126, and sharing a 
shared L2 cache 128. Each node 102, 104 further includes an 
L3 cache 130, 132, with each device 106, 108 and L3 cache 
130, 132 interfaced with a main memory 134 over a shared 
bus 136. 

System-level response combining logic is illustrated at 
138, with the additional snoop/combining logic disposed 
within each cache and processor device not shown sepa- 
rately. With two nodes, four levels of memory (LI, L2, L3 
and main memory), and two processors in each node, it 
should thus be appreciated that each potential source in the 
system can be represented in a response signal via a 2-bit 
level identifier, a 1-bil node identifier, and a 1-bit processor 
identifier, for a total of 4-bits of source identification infor- 
mation. However, to simplify the discussion hereinafte r, 
rather than identifying each device/memory by a combina - 
tion of level, node and/or processor information, the various 
processing units and caches in FIG. 5 are assigned uniq ue 
m maerical identifiers, including processing uruts (PU's) 
0. . . 3, LI cadics 0 . . . 3, L2 caches 0 ... 1 and L3 caches 
0. . - 1. 
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It is assumed for the purposes of this exemplary embodi- 
ment that at least four states arc supported, including invalid 
(I), shared (S), modified(M) and tag (T), whidi represents 
data that has been modified and must be written to memory 
sometime in the future. In addition, a fifth state, allocate (A), 
is also supported, whereby a directory entry is allocated for 
a particular cache fine, but the actual data is not written to 
the cache. With this latter, optional state, bandwidth is 
conserved since the data need not be immediately written to 
the allocated entry in the cache. Space is still reserved in the 
cache, however, for that entry. 

A state transition table resulting fi:om a read request from 
processing unit PU(0) 108 is shown below in Table I, 
indicating the slate transitions that occur as a result of 
issuing the read request when the data is stored in various 
memory sources in the data processing system, ajid based 
upon whether the data is modified in that source: 

TABLE I 

Oic\ic State TYansitions Resulting from FUfO) ficad Request 
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40 The nomenclature "x/y" indicates that a transition occurs 
from state x to state y in the particular source as a result of 
the operation. 

In general, the state transitions illustrated in Table I route 
data selectively within a node based upon the level of 

45 memory that is sourcing the request. Thus, whenever data is 
sourced from another LI memory, the data is routed only to 
the requester's LI cache. Whenever data is sourced from an 
L2 memory, the data is routed only to the requester's LI and 
L2 caches, and whenever data is source from an L3 memory, 

50 the data is routed to each of the LI, L2 and L3 caches 
associated with the requester. Data sourced from main 
memory may be routed to each of the LI, L2 and L3 caches, 
or possibly it may be desirable to omit routing the data to the 
L2 cache in such a circumstance. 

55 The logic flow that implements the transition rules set 
forth in Table I is represented at 160 in FIG. 6. It will be 
appreciated that while the logic flow is shown as occurring 
in sequence, the logic is typically distributed among mul- 
tiple logic circuits that operate independently and concur- 

60 rcntiy with one another. The sequential representation 
shown in FIG, 6 is merely presented to simplify the expla- 
nation of the operation of the logic circuitry. For example, 
determination of whether a request hits or misses different 
caches in a processor and/or system typically occurs in 

65 parallel by all caches and other sources of memory. 

As shown in HG. 6, it is first determined in block 162 
whether a hit occurs in L1(0), the LI cache associated with 
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PU(0).Ifso, DO additional processing by the shared memory the cache line into 1-3(0) and transitions L3(0) from the 

sj-stcm is required. If a miss occurs, however, block 164 invalid state to the tag slate. Control then passes to blocks 

determines whether a hit occurs in LI (1), the other LI cache 192 and 193 to write the cache line respectively into L2(0) 

on the same processor integrated circuit device as PU(0). If and L1(0) and transition each cache to the shared state, 
so, block 166 next determines whether the cache line for the 5 completing handUng of the request. And remming to blodc 

requested daU is modified in Ll(l). If so, in block 168 the if the data is not modified, block 204 sources the cache 

data is sourced (retrieved) from Ll(l), and the state of the l^e for the requested data from L3{1) and trariaUons L3(l) 

cache line transitions from modified to shared. Next, block ^ the shared state, and blocks 206, 192 and 193 wnte the 

170 writes the requested cache line into LI (0) and transi- ^^^^he Ime respectively into 13(0), L2(0) and LI (0) and 
tions the cadie line from invalid to tag. Processing of the lO transition each cache to the shared state. Processing of the 

request is then complete. ^ complete. 

Returning to block 166, if the data is not modified, block , ^^^^.^'^^ V° ^ ''uT.^'' ^uV^^ T 

172 sources the cache line from Ll(l) and transitions Ll(l) wT' t'T^^al "^"^ u I 'I ' 

* 4U u A * * ♦ ui «u iT^„^*^. *K« k^i™,vt-^ L2(l), If SO, block 210 determmes whether the data has been 

to the shared state. Next, block 174 wntes the cache hne into .4 j ui t -^^-i »u u v e 

, J t- . u J . . 11 modified. If so, block 212 sources the cache line for the 

L1(0) and transiUons the cache to the shared state as well, 15 " "^""7^.^. , ^ ... , . 

- J- .* .u ♦ u T i/n\ ^ T i/i\ • «f requested data from L2(l) and transitions L2(l) from the 

mdicatmg that both L1(0) and Ll(l) mclude valid copies of \ a ^ *u u j . . tr *u . .u j. 

, „ - r.L .-.i- 1. modified to the shared state. However, civen that the data 

the cache hne. Processing of the request is then complete. muuiu^u auot^^ ^ a xx^J ,g 

^ . / V Li 1 was stored m an L2 cache rather than an L3 cache, an 

Returning to block 164, if the request misses Ll(l), block assumption is made that the data is relatively "warm" (more 

176 determines whether a hit occurs id L2(0), the L2 cache frequently accessed), and rather than writing the data into 

on the same processor mtegrated circuit device as PU(0). If ^3 cache, the L3 cache is bypassed, and the data is 

so, block 178 next determmes whether the cache line for the ^^^^^ ^2^^^ ^ ^^^^ 214, includmg transitioning 

requested data is modified m 12(0). If so, m blodc 180 the l2(0) from the invalid state to the tag state. Control then 

data IS sourced from L2(0), and the state of the cache Ime ^ ^^^^ ^93 write the cache line into L1(0) and 

transitions from modified to tag. Next, control passes to transition that cache to the shared state, completing handhng 

block 174 to write the cache line into L1(0) and transiUon ^^^^^^ returning to block 210, if the data is not 

the cache line in L1(0) to the shared state. Processmg of the modified, block 216 sources the cache hne for the requested ' 

request is then complete. Rcturmng to block 178, if the data ^^^^3 ^^^^ ^2(1) and transitions L2(l) to the shared state, and 

IS not modified, block 182 sources the cache line from L2(0) ^^^^ 2I8 and 193 write the cache hne respectively into 

and transitions L2(0) to the shared state. Next, block 174 ^ l2(q) l^^qj transition each cache to the shared 

writes the cache line into LI (0) and transitions the cache to ^^^^^ Processing of the request is then complete. As sudi, it 

the shared state. Processmg of the request is then complete. ^^^^^ ^^^^ ^ sourced from an L2 cache in 

Remrning to block 176, if it is determined that the request this example, bandwidth and storage space associated with 
misseseachof LI (0), Ll(l) and L2(0), the request canoot be the L3 cache for the associated processing unit arc con- 
handled within the processor integrated circuit device, and served. 

as such the request must be fulfilled either by main memory Returning to block 208, if L2(l) is not hit, block 220 next 

or by the other processor device. Accordingly, block 184 determines whether the request has hit either LI cache in the 

broadcasts the request, specifically the requested address Q^^er processor integrated circuit device, Ll(2) or Ll(3). If 

and a request type (e.g., read, read with intent to modify, ^q, block 222 determines whether the data has been modi- 
write, claim, etc.) over system bus 136 (FIG. 5), which is in ^ ged. If so, block 224 sources the cache hne for the requested 

turn snooped by each of the memory sources coupled to the ^^^^ ^^^^ appropriate LI cache, Ll(2) or Ll(3), and 

bus (here processor integrated circuit device 108 and 13 transitions such cache from the modified to the shared state, 

caches 130, 132). Snoop logic within device 108 further However, given that the data was stored in an LI cache 

broadcasts appropriate request information to each of the j-^ther than an L2 or L3 cache, an assumption is made that 

sources within the device, in essentially the same manner as ^ata is relatively "hot" (most frequenUy accessed), and 

a conventional multi-level snoopy coherence protocol j-^ther than writing the data into the L2 and U caches, the 

As shown in block 186, it is next determined whether the L2 and L3 caches are bypassed, and the data is written 

broadcast request has hit L3(0), the L3 cache associated with directly into L1(0) in block 226, including transitioning 

PU(0). If so, block 188 determines whether the data has been L1(0) from the invalid state to the tag state. Processing of the 
modified. If so, block 190 sources the cache line for the 50 request is then complete. And returning to block 222, if the 

requested data from L3(0) and transitions L3(0) from the data is not modified, block 228 sources the cache line for the 

modified to the tag sute. Next, blocks 192 and 193 write the requested data from Ll(2) or LI (3) and transitions such 

cache Hne respectively into L2(0) and L1(0) and transition cache to the shared state, and block 193 write the cache line 

each cadie to the shared state, essentially maintaining inclu- into and L1(0) and transitions the cache to the shared state, 
sion in the node for PU(0). Processing of the request is then 55 Processing of the request is then complete. As such, it can be 

complete. Returning to block 188, if the data is not modified, seen that, when data is sourced from an LI cache in this 

block 194 sources the cache line for the requested data from example, bandwidth and storage space associated with both 

L3(0) and transitions L3(0) to the shared state, and blocks the L2 and L3 caches for the associated processing unit arc 

192 and 193 write the cache hne respectively into L2(0) and conserved. 

L1(0) and transition e&ch cache to the shared state. Process- go Returning again to block 220, if the request does not hit 
ing of the request is then complete. any cache, block 230 sources the request from main 
Returning to block 186, if L3(0) is not hit, block 196 memory The L3 and LI caches associated with the request- 
determines whether the broadcast request has hit the other ing processing unit PU(0), L3(0) and L1(0), are then written 
L3 cache, L3(l). If so, block 198 determines whether the to and the states thereof arc transitioned to the shared state, 
data has been modified. If so, block 200 sources the cache 65 completing processing of the request In this circumstance, 
Une for the requested data from L3(l) and transitions L3(l) however, bandwidth and storage space associated with the 
from the modified to the shared state. Next, blodc 202 writes L2 cache are conserved 
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Similar to Table I, Table II below shows illustrative state 
transitioiis responsive to a read with intent to modify 
(RWITM) request issued by processing unit PU(0). As is 
known in the art, a RWITM request is often issued to retrieve 
data to a local cache for a processing unit for subsequent 
modification by the processing unit. 

TABLE II 

Cache State Ttansidons Resulting from PUfOI RWITM Request 
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In this implementation, the allocated (A) state indicates 
that an entry in the appropriate source is allocated, but that 
the actual data is not written into that entry in the source. As 
a result, bandwidth to that source is conserved. In the 
alternative, either allocating an entry can be omitted, or 
inclusion may be utilized to maintain additional copies in 
other cache levels. 

It should be appreciated that the logic flow that imple- 
ments the transition rules set forth in Table II would proceed 
in much the same manner as the logic flow iQustrated for 
Table I in FIG. 6, with appropriate substitutions made based 
upon the state transitions set forth in Table IL Moreover, it 
should be appreciated that additional logic may be required 
to fully implement a shared memory system with selective 
data routing as described herein. For example, state transi- 
tion tables similar to Tables I and II may be developed to 
handle read and RWITM requests from each of processors 
PU(1), PU(2) and PU(3), as well as to handle other requests 
that may be made in the specific system. Furthermore, 
additional logic may be required to implement appropriate 
state transitions when the initial states of some sources differ 
from those set forth in Tables I and II. Moreover, as 
discussed above dif[erent numbers and arrangements of 
processing units, cache memories, shared memories, etc. 
may also be used, which would necessarily require custom- 
ized logic circuitry to handle selective data routing in the 
manner described herein. However, it would be well within 
the abilities of the ordinary artisan having the benefit of the 
instant disclosure to implement any of such customized 
logic circuitry to implement desired functionality consistent 
with the present invention. 

Selective data routing consistent with the invention has a 
number of unique advantages over conventional designs. 
Bandwidth, and optionally storage (when entries arc not 
allocated) in caches and other memory sources may be 
conserved by reducing the amount of redundant data that is 
maintained in multiple caches. 

Moreover, various modifications may be made to the 
illustrated embodiments without departing fi:om the spirit 
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and scope of the invention. For example, it may also be 
beneficial to utilize source identification information to 
selectively invalidate cache lines in noD-sourcing caches. As 
one iUustrative embodiment, it may be desirable to utilize 
source identification informarion to detect when a source in 
a particular node is sourdng a request fi'om another node, so 
that the other memories in the sourdng node can be invali- 
dated along with the specific source for the requested data. 

In addition, in some implementations it may be desirable 
to support the retrieval of data directly to a requester, 
without storing the data in any external memory source 
associated with that requester. For example, it may be 
desirable to completely bypass any intervening memory 
sources (e.g., any of the Ll, L2 and L3 caches associated 
with a processing unit), and instead forward data directly to 
a processing unit, responsive to a request that responds with 
a "not-cached" source identification (e.g., data that is found 
in a register/buffer internal to an L3 cache controller, or in 
another processor's register/buffer, such as respectively 
shown at 53 and 55 in FIGS. 2 and 3). It may also be 
desirable to do so when requested data is found in another 
processing unit's Ll cache, and modified by a specific 
instruction (e.g., a STCX instruction in the PowerPC 
architecture). In either event, if the data was found modified, 
it would be desirable to mark the data in the tag state. A 
directory associated with any processing unit's registers/ 
buffers would also be used to track such variables to permit 
the registers/buffers to be snooped if another requester 
requests the same data. Once such data was used, the data (if 
modified) may be written to a fast access buffer in the L3 
cache controller so that the buffer can readily provide the 
data for use by any requester when it is again requested. 

As another alternative, a combined response may be 
selectively provided to only a subset of the memory sources, 
e.g., simply to the requester device. Also, other mechanisms 
may be used to determine where to selectively route data, 
e.g., based upon other concerns such as available bandwidth 
in each potential memory, or various "fairness" algorithms. 

Other modifications will be apparent to one of ordinary 
skill in the art. Therefore, the invention lies in the claims 
hereinafter appended. 

What is claimed is: 

1. A method of routing data in a multi-requester circuit 
arrangement including a plurality of requesters coupled to a 
plurality of memory sources, wherein each requester is 
associated with at least a portion of the plurality of memory 
sources, the method comprising: 

(a) responding to a memory request by a first requester 
among the plurality of requesters, including providing 
source identification information associated with the 
memory source that is returning the requested data; and 

(b) responsive to the soxirce identification information, 
selectively routing the requested data to only a subset 
of the memory sources associated with the first 
requester. 

2. The method of claim 1, wherein the plurality of 
requesters includes first and second processing units. 

3. The method of claim 2, wherein the phirality of 
memory sources are organized into at least first and second 
levels, the first level including first and second memory 
sources respectively assodated with the first and second 
processing units, and the second level including a third 
memory source shared by the first and second processiiig 
units, 

4. The method of daim 3, wherein the first and second 
memory sources arc primary cache memories, the plurality 
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of sources further including first and second secondary 17. The method of claim 14, further comprising invali- 

cache memories respectively associated with the first and dating data stored in at least one of the primary and 

second processing units. secondary cadie memories associated with the second pro- 

5. The method of claim 1, wherein the source identifica- ^^^^^^ 
tiOD mformatioD mcludes a memory level mdicator that < ■ ^ -.S ^ j - - .l 
indicates a level of memory sourcing the requested data. associated with the second processor is sourcmg the 

6. The method of daim 5, wherein at least one of the requested data. 

plurality of requesters is a processing unit, and wherein the 18. A method of routing data in a multi-requester circuit 

source identification information further includes a process- arrangement including a plurahty of requesters coupled to a 

ing unit indicator that identifies a processing unit if any from plurahty of memory sources, wherein each requester is 

the plurality of requesters that modified the requested daU. associated with at least a portion of the plurality of memory 

7. The method of claim 5, wherein the source identifica- smrccs, the method comprising: 

tion information further includes an instruction indicator that , ^ . . ^ 

identifies an instruction if any that modified the requested (^) responding to a memory request by a first requester 

among the plurahty of requesters, including providing 

8. The method of claim 1, wherein the memory sources source identification information associated walh the 
associated with the first requester includes a cache memory, memory source that is reUiming the requested data; and 
the method further comprising allocating a directory entry in (b) responsive to the source identification information^ 
the cache memory without storing the requested data in the selectively routing the requested data directly to the 
cache memory responsive to the source identification infor- ^ first requester without routing the requested data to any 
ni^^oii- of the memory sources associated with the first 

9. The method of claim 1, wherein providing the source requester 

identification information associated with the memory 19. A circuit arrangement, comprising: 

source for the requested data includes: . . 

, . ... i_ J- . 1 . ^- r.L (a) a plurality of memory sources: 

(a) generatmg in each of at least a portion of the memory 95 

sources a coherency response, at least one of the W ^ pl^aUty of requesters coupled to the plurahty of 

coherency responses including the source identification memory sources, each requester associated with at least 

information; and a portion of the plurality of memory sources; and 

(b) generating a combined response from the coherency (c) a data routing circuit configured to selectively route 
responses, wherein selectively routing the requested 3Q data requested by the first requester to only a subset of 
data is responsive to the combined response. the memory sources associated with the first requester 

10. The method of claim 9, further comprising transmit- responsive to source identification information pro- 
ting the combined response at least to the memory source vided by a memory source that is returning the 
that is rettirning the requested data. requested data. 

11. The method of claim 1, further comprising invalidat- 35 20. The circuit arrangement of claim 19, wherein the 
ing data stored in at least one memory source responsive to ^^^^^^^^ ^f requesters includes first and second processing 
the source identificahon information, units, and wherein the plurahty of memory sources are 

12. Tlie method of cl^ 1, further compnsmg selectively ^^^j into at least first and second levels, the first level 
routmg the requested data directly to the first requester • , j- ^ . ^ j , 

**. z -t .-n • r includine first and second memory sources respectively 

responsive to the source identification information. 40 . j , . r j .1 

13. A method of routing data in a multi-processor circuit ^^ff^ ^^h the first and second processmg umts and the 
arrangement including first and second processors, each s^ond level mcludmg a third memory source shared by the 
processor coupled to and associated with a plurality of ^^"^""^ processmg umts, 

memories, the method comprising: 21. The circuit arrangement of claim 19, wherem the 

(a) responding to a memory request by the fiist processor 45 identification information includes at least one of a 
by outputting requested data fi-om one of the plurality memory level mdicator that mdicates a level of memory 
of memories associated with the second processor, sourcing the requested data, a processing unit indicator that 
including indicating which of the plurality of memories identifies a processing unit if any that modified the requested 
associated with the second processor is sourcing the data, and an instmction indicator that identifies an instruc- 
requestcd data; and 50 ^on if any that modified the requested data. 

(b) selectively routing the requested data to only a subset 22. The circuit arrangement of claim 19, wherein the 
of the plurahty of memories associated with the first memory sources associated with the first requester includes 
processor based upon which of the plurality of memo- a cache memory, and wherein the data routing circuit is 
ries associated with the second processor is sourcing further configured to allocate a directory entry in the cache 
the requested data. 55 memory without storing the requested data in the cache 

14. The method of claim 13, wherein each of the first and memory responsive to the source identification information, 
second processors is associated with at least primary and 23. The circuit arrangement of claim 19, further compris- 
secondary cache memories. ing: 

15. The method of claim 14, wherein each of the first and 1 • • ^ \ * • ^ r 

u 111 Luu« V.1 ixii 11 uj i til xiis. oxi . x ^ suoop logic circmt configured to generate m each of 

second processors IS further associated with a tertiary cache 60 ,, u 

■'at least a portion of the memory sources a coherency 

"Txhemeaiodof claim 14,furlhercomprising allocating " °°f "l^^. «>J":«ncy . ' ^^ponses 

a directory entry in at least one of the primary and secondary including the source identificaUon mformation; and 

cache memories associated with the first processor without (b) a response combining logic drcuit configured to 

storing the requested data therein based upon which of the 65 generate a combined response fi'om the coherency 

plurality of memories associated with the second processor responses, wherein the data routing circuit is rcspon- 

is sourcing the requested data, sive to the combined response. 
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24. The circuit arrangement of claim 23, wherein the 
response combining logic circuit is configured to transmit 
the combined response at least to the memory source that is 
returning the requested data. 

25. The circuit arrangement of claim 19, wherein the data ^ 
routing circuit is further configured to invalidate data stored 

in at least one memory source responsive to the source 
identification information. 

26. The circuit arrangement of claim 19, wherein the data jq 
routing circuit is further configured to selectively route the 
requested data directly to the first requester responsive to the 
source identification information. 
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27. A data processing system, comprising: 

(a) a plurality of memory sources; 

(b) a plurality of requesters coupled to the plurality of 
memory sources, each requester associated with at least 
a portion of the plurality of memory sources; and 

(c) a data routing circuit configured to selectively route 
data requested by the first requester to only a subset of 
the memory sources associated with the first requester 
responsive to source identification information pro- 
vided by a memory source that is returning the 
requested data. 
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